Skip to content

Conversation

@achimnol
Copy link
Member

@achimnol achimnol commented Sep 27, 2025

resolves #5926 (BA-2404)

Core changes:

image image
  • Rewrite SlotName to be a UserString subclass which lazily parses the slot name format.
    • {device_name}.{major_type}[:{minor_type}]
  • Update gpu_allocated calculation to use SlotName.is_accelerator() method to sum all allocated accelerators
    • Centralize the logic to guess if the slot name is an accelerator or not, for the future.
  • Update ResourceSlot to use UserDict with proper generic type arguments

Warning

When there are multiple different accelerators allocated in a single container or installed in a single agent node, this simple "sum" for gpu_allocated may have a non-sense value when mixing different units and fraction scales. We need some design discussion here.
This is kept as a known issue because the previous implementation was same in this sense.

Checklist: (if applicable)

  • Mention to the original issue
  • Test cases for:
    • New SlotName parsing implementation

📚 Documentation preview 📚: https://sorna--6087.org.readthedocs.build/en/6087/


📚 Documentation preview 📚: https://sorna-ko--6087.org.readthedocs.build/ko/6087/

@github-actions github-actions bot added size:XL 500~ LoC comp:manager Related to Manager component comp:agent Related to Agent component comp:common Related to Common component size:L 100~500 LoC and removed size:XL 500~ LoC labels Sep 27, 2025
@github-actions github-actions bot added the area:docs Documentations label Sep 29, 2025
@achimnol achimnol force-pushed the fix/generalize-gpu-allocated-stat-field branch from dbb23e9 to 4aab698 Compare October 29, 2025 06:07
@achimnol achimnol marked this pull request as ready for review October 29, 2025 08:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the SlotName type from a simple NewType to a UserString class with parsing capabilities, and updates ResourceSlot to properly handle string keys and normalize values. The changes improve type safety and consistency in GPU resource allocation tracking across different accelerator types (CUDA devices, shares, MIG variants, and NPUs).

Key Changes

  • Converted SlotName from NewType to a UserString class with lazy parsing and accelerator detection capabilities
  • Enhanced ResourceSlot to properly normalize keys to strings and process raw values (int, float, str, Decimal, BinarySize) into Decimal
  • Updated GPU allocation tracking to use generic accelerator detection instead of hardcoded CUDA-specific slot names
  • Added serialization support for SlotName and improved JSON encoding for test fixtures

Reviewed Changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/ai/backend/common/types.py Core refactoring: SlotName class implementation and ResourceSlot normalization improvements
src/ai/backend/common/msgpack.py Added msgpack serialization support for SlotName
src/ai/backend/common/utils.py Added debugging utility pprint_with_type for type inspection
src/ai/backend/common/resilience/policies/retry.py Added default non-retryable exceptions to retry policy
src/ai/backend/manager/models/resource_usage.py Updated GPU allocation to use SlotName.is_accelerator()
src/ai/backend/manager/repositories/user/repository.py Updated GPU stats aggregation to sum all accelerator slots
src/ai/backend/manager/repositories/group/repository.py Updated container stats to use generic accelerator detection
src/ai/backend/manager/repositories/resource_preset/repository.py Added str() conversion for dict keys
src/ai/backend/manager/models/image.py Added str() conversion for resource limit keys
src/ai/backend/manager/api/etcd.py Added str() conversion for API response keys
src/ai/backend/manager/idle.py Improved variable naming from val to slot_val/util_val
src/ai/backend/manager/sokovan/scheduler/selectors/concentrated.py Added type hint for sort_key list
src/ai/backend/agent/resources.py Updated to use ResourceSlot type and SlotName class
src/ai/backend/agent/alloc_map.py Added str() conversions for fnmatch operations
src/ai/backend/agent/docker/agent.py Changed from dict unpacking to slots.copy()
src/ai/backend/agent/dummy/agent.py Changed from dict unpacking to slots.copy()
src/ai/backend/agent/kubernetes/agent.py Changed from dict unpacking to slots.copy()
src/ai/backend/agent/stage/kernel_lifecycle/docker/resource.py Changed from dict unpacking to slots.copy()
src/ai/backend/accelerator/cuda_open/plugin.py Added str() conversion for slot_name in metadata
tests/manager/conftest.py Enhanced fixture_json_encoder to handle ResourceSlot and Decimal
tests/manager/repositories/keypair_resource_policy/test_keypair_resource_policy.py Fixed test data to use string values for resource slots
tests/common/test_types.py Added comprehensive tests for SlotName parsing and BinarySize/Decimal conversion
changes/2404.fix.md Added changelog entry

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +50 to +54
DEFAULT_NON_RETRYABLE_ERRORS: Final = (
TypeError,
NameError,
AttributeError,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be provided publicly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Documentations comp:agent Related to Agent component comp:common Related to Common component comp:manager Related to Manager component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

gpu_allocated field should include all accelerators

3 participants